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Executive Summary 

Educator performance evaluation systems are a potential tool for improving student 
achievement. 1 By removing ineffective teachers and principals and/or through increasing the 
effectiveness of the existing workforce, such systems may result in higher student achievement. 2 

Emerging research suggests some promising features of performance evaluation measures. For 
example, research suggests that, to measure classroom practice, additional observations of the 
same teacher, beyond the first, form a more reliable measure of a teacher’s typical practice, 
especially when more than one observer is used. 3 There is also some evidence from recent 
research that giving more frequent, specific feedback on classroom practice may lead to 
improvements in teacher performance and student achievement. 4 

The U.S. Department of Education’s Institute of Education Sciences is conducting a study on the 
implementation and impacts of teacher and principal performance measures that are consistent 
with emerging research. As part of the study, eight districts were provided resources and support 
to implement the following three performance measures in a selected sample of schools in 20 12- 
Band 2013-14: 

• a measure of teacher classroom practice with subsequent feedback sessions conducted 
four times per year, based on a classroom observation rubric; 

• a measure of teacher contributions to student achievement growth (i.e., value-added 
scores), provided to teachers and their principals once per year; and 

• a measure of principal leadership with subsequent feedback sessions conducted twice per 
year, using a leadership survey administered to the principal, the principal’s supervisor, 
and the principal’s teachers. 5 

No formal “stakes” were attached to the measures - for example, they were not used by the study 
districts for staffing decisions such as tenure or continued employment. 6 Instead, the measures 
were used to provide educators and their supervisors with information regarding performance. 


1 See Stecher et al. (2016); Weisburg, Daniel, Sexton, Mulhern, and Keeling, (2009). 

2 Researchers studying a range of educator workforce interventions theorize that there are two key mechanisms 
leading to improved student achievement: changes in the composition of the workforce and changes in the skills of 
continuing workers. See, for example, Taylor and Tyler (2012) regarding teacher evaluation, Chiang et al. (2015) 
regarding performance-based compensation, and Glazerman et al. (2010) regarding comprehensive teacher 
induction. 

3 See Bill & Melinda Gates Foundation (2012); Whitehurst, Chingos, and Lindquist (2014). 

4 See Steinberg and Sartain (In Press); Taylor and Tyler (2012). 

5 These measures are the kinds of measures emphasized through the Elementary and Secondary Education Act 
flexibility waivers, as well as federal programs such as Race to the Top and the Teacher Incentive Fund. They also 
are the kinds of measures that states are allowed to develop and support with Title II, Part A funds under the Every 
Student Succeeds Act of 2015. 

6 There were exceptions in three districts. In these districts, the observations conducted by principals as part of this 
study counted in their official rating system if the teacher was due to be observed that year under the district’s 
existing evaluation system. 
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Such information might affect perceptions about perfonnance, motivate improvement, and/or 
lead to improved knowledge and skills among educators. 

This report focuses on the first year of the two years of implementation, describing the 
characteristics of the educator performance measures and teachers’ and principals’ experiences 
with performance feedback. It is one of the few detailed descriptions of educator perfonnance 
measures implemented on a large scale in districts. 7 The final report will primarily focus on 
impacts on outcomes including principal leadership, teacher classroom practice, and student 
achievement but will also include descriptive infonnation on the second year of implementation. 

The main findings in this report are as follows: 

• Educator performance measures were implemented generally as planned, 
except that fewer than the intended number of educators accessed the 
student growth reports. Most teachers received the intended four rounds of 
observations and feedback sessions (mean = 3.8 observations and 3.7 feedback sessions 
per teacher) by trained and certified observers. Only 40 percent of principals and 39 
percent of teachers with value-added scores accessed their student growth reports. All 
principals in the implementing schools received two reports about their leadership 
practices based on their survey responses and that of their teachers and their supervisor. 
They also met with their supervisors to discuss their reports, in both fall and spring. 

• Both classroom observation and student growth measures differentiated 
teacher performance, but observation scores were skewed toward the 
upper end of the scale. In each classroom observation window, a large majority of 
the teachers observed had classroom observation overall scores in the top two 
perfonnance levels (more than 85 percent), and very few teachers had overall scores in 
the lower two levels (less than 15 percent). However, there was some variation in scores 
across teachers, and both value-added scores and average classroom observation scores 
over the year had sufficient reliability to capture perfonnance differences among some 
teachers. About one quarter of the teachers with value-added scores in reading/English 
language arts and one half of the teachers with value-added scores in mathematics had 
student growth reports indicating that their score in that subject was measurably below or 
above the district average. 8 

• The principal leadership measure differentiated performance, but there was 
limited consistency in scores across survey respondent groups. Principals’ 
overall leadership scores were spread across the four performance levels. Half or more of 


7 For an example, see Lipscomb, Terziev, and Chaplin (2015). 

8 To test whether teachers’ value-added scores differed from the district average, we used an 80 percent confidence 
interval. There were two reasons for the use of an 80 percent rather than a 95 percent confidence interval more 
typical of statistical tests. First, the student growth report available to principals and teachers in the intervention 
schools included the score with an 80 percent confidence interval. As part of the student growth report training, 
educators were told to use this confidence interval to determine if their score differed from average performance in 
the district. Second, the value-added scores were used for informational purposes and not decisions with 
consequences for employment, for example. Given this, the 80 percent confidence interval was intended to balance 
the risks associated with mistakenly classifying average performers as above or below average with the risks 
associated with mistakenly classifying above or below average performers as average. 
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principals in the schools selected to implement the intervention received overall scores on 
the principal leadership measure in the lower two categories labeled basic or below basic 
(70 percent in the fall and 5 1 percent in the spring administration). However, the 
respondent groups (principals, teachers, and principal supervisors) often scored principals 
differently. 

• Both teachers and principals in schools selected to implement the 

intervention reported receiving more feedback than those in schools in the 
same districts selected to continue with business-as-usual. For example, 
teachers in intervention schools reported more feedback sessions that were accompanied 
by a rating and written narrative (3.0 versus 0.7 sessions). 9 They also reported spending 
more total time in feedback sessions (80 minutes versus 18 minutes). Principals in 
intervention schools reported more instances of receiving feedback with ratings (1.0 
versus 0.4) and spending more total time in feedback sessions (60 minutes versus 41 
minutes). 

Study Overview 

The purpose of this study is to describe teachers’ and principals’ experiences with the study’s 
performance measures and feedback over two years, and to examine whether the infonnation 
provided by the measures and feedback affected educator and student outcomes. The study has 
five research questions: 

1 . To what extent were the performance measures and feedback implemented as planned? 

2. To what extent did the performance measures distinguish educator perfonnance? 

3. To what extent did educators’ experiences with perfonnance feedback differ for 
treatment and control schools? 

4. Did the intervention have an impact on teacher classroom practice and principal 
leadership? 

5. Did the intervention have an impact on student achievement? 

This report addresses the first three questions, focusing on the first year of implementation. A 
subsequent report will address the first three questions focusing on the second year of 
implementation. That report will also address the last two questions. 

Study Design 

The study examines the implementation and impacts of an intervention consisting of three 
perfonnance measures with feedback for teachers and principals through an experimental design 
conducted in eight purposefully selected districts. We recruited districts that met the following 
criteria: (1) had at least 20 elementary and middle schools, (2) had data systems that were 
sufficient to support value-added analysis, and (3) had current performance measures and 
feedback that were less systematic and intensive than that implemented as part of the study. 
Recruited districts required fewer than four observations of teachers per year. In addition, the 


9 Throughout this executive summary, all treatment-control differences noted are statistically significant using two- 
tailed tests and applying a 5 percent critical value, unless noted otherwise. 
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districts’ evaluation systems did not require the inclusion of student achievement infonnation in 
teacher ratings. None of the recruited districts used a leadership measure similar to that used by 
the study. 

The study used two different observation measures to make the study findings more broadly 
relevant than would be the case if only one measure was used. Four of the eight study districts 
used the Classroom Assessment and Scoring System (CLASS) and the other four study districts 
used Charlotte Danielson’s Framework for Teaching (FFT). The observation rubrics were not 
randomly assigned but rather assigned based on district preference. Thus, differences in the study 
results in the CLASS and FFT districts cannot necessarily be attributed to the CLASS and FFT 
observation systems; differences could occur due to other district characteristics. 

Each study district identified a set of regular elementary and middle schools that were willing to 
participate in the study. In these schools, the study focuses on the teachers of mathematics and 
reading/English language arts in grades 4-8, as well as the principals. 10 The schools were 
assigned by lottery to implement the three measures with feedback (the treatment group) or not 
(the control group). Both groups continued to implement their district’s existing performance 
evaluations and measures, and the treatment group additionally implemented the study’s 
perfonnance measures with feedback. In total, 63 treatment schools and 64 control schools 
participated in the study. 

Consistent with the recruitment criteria, the study districts are larger and more likely to be urban 
than the average U.S. district. The study schools were similar to schools in the national 
population in terms of enrollment and Title I status, but on average had a higher percentage of 
students who were minorities. 

Data for this report came from multiple sources as described next. 

Data on the implementation of the intervention. We documented attendance at 
orientation and training events related to the study’s perfonnance measures. Online system 
records maintained by the vendors of the measures were used for infonnation on observer 
certification test pass rates, the frequency and timing of teacher observations and feedback 
sessions, and teachers’ and principals’ accessing of student growth reports. Surveys of observers 
hired by the study and interviews with district officials provided further infonnation regarding 
the implementation of the observations and the district context, respectively. 

Data on measures of educator performance. Data on measures of teacher classroom 
practice, student growth, and principal leadership were collected through the vendors’ online 
systems. 

Data on educators’ experiences with performance feedback. In spring 2013, we 
surveyed both the principals and teachers in all treatment and control schools. These surveys 


10 Teachers of Kindergarten through grade 3 also participated in the study. This was done mainly to promote 
schoolwide engagement in the implementation of the classroom practice and principal leadership performance 
measures. These teachers are not included in the main study analyses, however, because student assessment data are 
not available in Kindergarten through grade 3. 
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collected information on the nature and frequency of perfonnance infonnation educators 
received and their perceptions of that infonnation. 

Data on the characteristics of study participants. To compare the characteristics of 
participants in the treatment and control groups, we collected data on school characteristics from 
the 2011-12 Common Core of Data and collected data on principals’, teachers’, and students’ 
characteristics from district administrative records. 

Analyses 

To examine the implementation of the performance measures, we describe the extent to which 
study participants received the training on the measures, carried out the performance 
measurement activities, and received performance infonnation and feedback as planned. We also 
examined the characteristics of the ratings teachers and principals received, including whether 
they distinguish between lower and higher performers. These analyses yielded the average rating 
scores, the percentage of ratings in each performance level, and the variation in the ratings across 
teachers and across principals. To assess whether the study’s intervention led to differences in 
educators’ experiences with performance measurement and feedback, we compared survey 
responses of teachers and principals in the treatment and control groups. 

Detailed Summary of Findings 

The following section provides additional infonnation about the extent to which each of the 
study measures was carried out as intended and whether the infonnation from the measures 
distinguished between lower- and higher-perfonning educators and thus could be used to identify 
educators in need of support. These analyses pertain only to teachers and principals in the 
treatment schools. This section also highlights the extent to which educators’ experiences with 
perfonnance information differed between the treatment and control groups. 

The Classroom Practice Measure and Feedback 

The teacher classroom practice measure was based on four classroom observations during the 
school year. For each teacher, one observation was to be conducted by a school administrator 
and the other three by observers hired by the study. After each observation, the observer was 
expected to prepare a standard report with both ratings and narrative justification and to discuss 
the report with the teacher during a feedback session. Both teachers and their principal had 
access to the standard report. 

How Many Observations Were Conducted and What Were Observers’ 
Qualifications? 

• Observers were trained and certified as planned. Nearly all observers (92 

percent for CLASS and 97 percent for FFT) completed all of the required training, which 
lasted three days for CLASS and four days for FFT. All observers passed the certification 
test, although it took multiple attempts to pass the test for half of the CLASS trainees and 
17 percent of the FFT trainees. 
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• The majority of teachers were observed the intended four times and 
received feedback. The majority of teachers (73 percent for CLASS and 95 percent 
for FFT) received all four observations, and the majority of teachers (57 percent for 
CLASS and 94 percent for FFT) also received all four feedback sessions, as intended. On 
average, teachers received 3.8 observations (3.7 for CLASS and 3.9 for FFT) and 3.7 
feedback sessions (3.5 for CLASS and 3.9 for FFT) during the first year of the study. 

What Were the Characteristics of the Classroom Practice Performance 
Information Provided? 

• CLASS reports provided separate scores for individual dimensions as well 
as the teacher’s overall score and a sense of how their performance 
compared with others; FFT reports provided only separate scores for 
individual dimensions. The CLASS reports included scores for 12 dimensions of 
teaching grouped into four teaching domains, as well as an overall score for the 
observation and a score for each domain (emotional support, classroom organization, 
instructional support, and student engagement). In addition, the CLASS reports included 
comparisons with the district average scores and the teacher’s prior scores. The FFT 
reports provided scores for up to 10 dimensions of teaching grouped into two teaching 
domains (classroom environment and instruction). The FFT reports did not include an 
overall score, domain scores, scores from past observations, or district average scores. 

• Most of the CLASS observation reports identified at least one dimension of 
classroom practice to improve and illustrated it with an example from the 
observation, but less than a quarter of FFT reports did so. The observers were 
required to write narrative text identifying at least one dimension of practice as a strength 
and one dimension for improvement. The majority of the observation reports (76 percent 
of CLASS reports and 71 percent of FFT reports) did so. In addition, three quarters of the 
CLASS reports supported the identified dimension(s) for improvement with at least one 
example from the observation, but less than a quarter (23 percent) of the FFT reports did 
so. 11 

• For both CLASS and FFT, observation scores were concentrated at the 
upper end of the scale, limiting the degree of differentiation between lower- 
and higher-performing teachers. Nearly all teachers had CLASS or FFT overall 
scores for a given observation window in the top two performance levels (more than 95 
percent of the CLASS scores and more than 85 percent of the FFT scores). Only a small 
percentage of the teachers had scores consistent with the lowest two performance levels 
(under 5 percent for CLASS and under 15 percent for FFT depending on the observation 
window). 12 (See exhibits ES. 1 and ES.2.) While most teachers had overall scores in the 
top two perfonnance levels, many teachers had dimension-level scores at different 
perfonnance levels (e.g., in the first window 61 percent of CLASS teachers and 69 
percent of FFT teachers received scores at multiple performance levels). 


11 The findings reported here are based on an analysis of 160 randomly selected reports. 

12 Teachers observed using the FFT instrument did not receive an overall score or overall performance level for each 
observation window. For analytic purposes, the study’s evaluation team calculated each teacher’s average score in 
each observation window based on the 1 to 4 rating for each dimension of practice. 


Executive Summary 


ES-6 


Year 1 Report 



• Teachers’ overall classroom observation scores, averaged across all four 
windows, contained measurement error, but provided some reliable 
information to distinguish between lower- and higher-performing teachers 
and were positively correlated with teacher value-added scores. Classroom 
observation scores averaged across the four observation windows had some reliability to 
help distinguish average teacher performance (reliability estimated between .42 and .50 
for CLASS and .69 and .75 for FFT). These estimates, while lower than conventional 
thresholds for measures used in research, are consistent with findings from other studies 
of classroom observation reliability. 13 In addition, the CLASS and FFT four-window 
average scores were positively, although weakly, associated with teachers’ prior-year 
value-added scores (correlations of .09 and .17, respectively). 14 

• Differences in a teacher's ratings across observations limited how much 
one could learn about persistent performance from a single observation. 

Less than half of the variation in teacher scores from a given observation window 
reflected stable classroom practice over the year. The reliability estimate for a single 
observation was .24 for CLASS scores and .49 for FFT scores, which indicates that 24 
percent of the variation in CLASS scores and 49 percent of the variation in FFT scores 
reflected stable practice over the year. 


13 See Casabianca et al. (2013); Ho and Kane (2013); Kane and Staiger (2012). 

14 Although the correlations between classroom observation overall scores and value-added scores were modest in 
magnitude, these correlations are consistent with the magnitudes found by other studies (Chaplin et al. 2014; Kane 
and Staiger 2012; Kane et al. 2011) and likely underestimate the strength of the true association because of 
measurement error in both the observation scores and the value-added scores. 
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Exhibit ES.1. Distribution of treatment teachers across performance levels based on 
CLASS overall scores, by observation window 
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Exhibit Reads: Of treatment teachers in CLASS districts observed in window 1 , 74 percent had a CLASS overall score at the highly 
effective performance level, 24 percent at the effective performance level, and 2 percent at the developing effectiveness 
performance level. Less than 1 percent of teachers had an overall score at the ineffective performance level. 

NOTE: Performance level distributions are based on teachers’ overall CLASS ratings in each window. Sample size = 262 teachers 
in window 1 , 307 teachers in window 2, 309 teachers in window 3, and 272 teachers in window 4. Reported percentages may not 
sum to 100 percent because of rounding. 

a Within a window, less than 1 percent of teachers had an overall score at the ineffective performance level. 

SOURCE: Teachstone Online System. 
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Exhibit ES.2. Distribution of treatment teachers across study-defined performance levels 
based on FFT overall scores, by observation window 
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Exhibit Reads: Of treatment teachers in FFT districts observed in window 1, 4 percent had an FFT overall score between 3.50 and 
4.00, 84 percent had a score between 2.50 and 3.49, and 12 percent had a score between 1.50 and 2.49. Less than one percent of 
teachers had an overall score below 1.50. 

NOTE: The distribution in each window is based on teachers’ FFT overall scores categorized into study-defined performance levels. 
To create the overall scores and performance levels, the study’s evaluation team first calculated an overall score by averaging the 
teacher’s ten FFT dimension scores, each of which was rated on a 1 to 4 scale. The overall scores were then categorized into 
study-defined performance levels by rounding them to the nearest whole number. This created four performance levels aligned with 
the FFT dimension scores. An FFT dimension score of 1 corresponds to unsatisfactory, 2 corresponds to basic, 3 corresponds to 
proficient, and 4 corresponds to distinguished. Average FFT scores and overall performance levels were not provided in the FFT 
reports teachers received. Sample size = 216 teachers in window 1,219 teachers in window 2, 220 teachers in window 3, and 217 
teachers in window 4. Reported percentages may not sum to 100 percent because of rounding. 
a Within a window, less than 1 percent of teachers had an overall score below 1 .50. 

SOURCE: Teachscape Online System. 


The Student Growth Measure 

The measure of student growth was designed to provide teachers with information on their 
contribution to student achievement, using value-added methods. Value added methods involve 
predicting the test score each student would have received, accounting for prior achievement and 
other characteristics, if the student had been taught by the average teacher in the district. A 
teacher’s value added score is obtained by comparing the average actual performance of the 
teacher’s students to the average of the students’ predicted scores. 

Teacher value-added scores were generated for all teachers of students in grades 4-8 
reading/English language arts and mathematics in each district using the achievement data for 
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the students that each teacher taught in the previous two years. 15 Individual teachers in the 
treatment schools in these grades were given access to a report on their scores during the first 
year of implementation. Treatment principals were also given access to a report that included 
their teachers’ student growth reports as well as school average value-added scores, overall and 
by subject and grade. 

Who Received the Student Growth Performance Information? 

• A large majority of teachers had sufficient data to produce student growth 
reports. Overall, student achievement data were sufficient to compute value-added 
scores and produce student growth reports for 80 percent of the teachers, who were in 
grades 4-8. 

• Although most teachers and principals participated in the student growth 
report training, less than half of the teachers and principals accessed their 
reports. Overall, 85 percent of teachers and 81 percent of principals participated in a 
webinar prior to the release of the student growth reports. The webinar oriented the 
participants to the value-added scores, the content of the student growth reports, and how 
to access them. The online reporting system showed that 40 percent of the teachers with 
value-added scores and 38 percent of the principals accessed their student growth reports. 

What Were the Characteristics of the Student Growth Performance Information 
Provided? 

• Student growth reports included school and individual teachers’ value- 
added scores. The teacher report included a teacher’s overall and subject-specific 
value-added scores (both reading/English language arts and mathematics for those who 
taught both subjects) with an indication of their percentile ranking relative to other 
teachers in the district, and the average teacher score in the district and school. All scores 
included confidence intervals/standard error information to indicate the precision of the 
estimated scores. Each teacher could also access a roster that included the number and 
names of students used to calculate their score. For each teacher in his or her school, the 
principal could view an overall value-added score, scores by subject and grade, and 
scores across time. Principals could also view school average scores overall and by 
subject and grade. 

• Many teachers with a student growth report had a value-added score that 
measurably differed from the district average, particularly in mathematics. 

The student growth reports available to teachers and principals included teachers’ value- 
added scores along with an 80 percent confidence interval, which could be used to 
detennine whether the scores were “measurably” different from the district’s average 
teacher. 16 For example, in mathematics, 25 percent of the teachers had a value-added 


15 A value-added score for a given subject was produced for a teacher only if the teacher had at least 10 students who 
had the necessary achievement data. 

16 The student growth reports used an 80 percent confidence interval (i.e., the range of scores that have an 80 percent 
chance of including the teacher’s “true” score) to identify scores that were “measurably” below or above average. 
This benchmark was selected in order to appropriately balance the risk of misclassifying a teacher who is actually 
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score that was considered measurably below the district average, and 28 percent had a 
score that was considered measurably above average. See exhibit ES.3. 


Exhibit ES.3. Distribution of treatment teachers based on whether their value-added 
score was considered measurably above or below the district average, by subject 
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Exhibit Reads: For treatment teachers with mathematics value-added scores, 28 percent had scores considered measurably above 
the district average. 

NOTE: Distributions of teachers are based on whether the 80 percent confidence interval for a teacher's value-added score was 
above or below the district average. To indicate the amount of uncertainty around each teacher’s score, the student growth reports 
included 80 percent confidence intervals, which showed the range of scores that have an 80 percent chance of including the 
teacher’s “true” score. This benchmark was selected in order to appropriately balance two types of risks within the context of an 
intervention designed to provide feedback on performance without explicit consequences such as promotion or dismissal: (1 ) the 
risk of misidentifying truly average teachers as below- or above-average, and (2) the risk of misidentifying teachers who were truly 
below- or above-average as average teachers. Sample size = 338 teachers with mathematics value-added scores and 321 teachers 
with reading/English language arts value-added scores. Reported percentages may not sum to 100 percent because of rounding. 
SOURCE: AIR value-added system. 


The Principal Leadership Measure and Feedback 

Feedback on principal leadership was based on the Vanderbilt Assessment of Leadership in 
Education (VAL-ED), a 360-degree survey assessment administered twice a year to principals, 
principal supervisors, and teachers. The VAL-ED includes six “core components” of principal 
performance: high standards for student learning, rigorous curriculum, quality instruction, 


average as above or below average, against the risk of misclassifying a teacher who is actually above or below 
average as average. One consideration in striking this balance was that the study districts agreed that the value- 
added scores would not be used for decisions with consequences for employment. This reduced the potential 
downside associated with misidentifying an average teacher as below average. 
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culture of learning and professional behavior, connections to external communities and 
perfonnance accountability. Principals are also rated on six “key processes”: planning, 
implementing, supporting, advocating, communicating, and monitoring. A report for each 
principal was generated after each administration of the VAL-ED, and the principal’s supervisor 
was expected to discuss the report with the principal in a feedback session. 

How Was the Principal Leadership Measure Implemented? 

• All principals and their supervisors received training on using VAL-ED. All 

principals and their supervisors participated in a two-hour VAL-ED training in summer 
2012. During the school year, all principals’ supervisors also received a one-hour training 
to prepare them to conduct the feedback sessions. In addition, teachers were offered a 
one-hour introduction to VAL-ED at the beginning of the school year, as well as an 
orientation webinar during the school year. 

• All VAL-ED reports incorporated input from the principal, the principal’s 
supervisor, and most teachers. All principals and their supervisors completed the 
VAL-ED rating fonn, and a high percentage of teachers in each treatment school (80 
percent in fall and 90 percent in spring on average) also completed the form. 

• All VAL-ED feedback sessions occurred as planned. In both fall and spring, all 
principals met with their supervisors to discuss their VAL-ED reports. Principal 
supervisors reported feedback sessions lasting on average 52 minutes in the fall and 46 
minutes in the spring. 

What Were the Characteristics of the Principal Leadership Performance 
Information Provided? 

• The VAL-ED reports present scores and performance levels, as well as 
percentile ranks, for each dimension of leadership. VAL-ED reports present an 
overall score, a score for each core component, and a score for each key process. For 
each of these 13 scores, the report additionally presents a perfonnance label and a 
percentile rank, relative to the principals included in a national VAL-ED field test. Each 
score (i.e., overall score, core component scores, and key process scores) is an average 
across the three respondent groups (i.e., principal, supervisor, and teachers), with each 
group weighted equally. The report additionally shows the scores received from each 
respondent group separately. 

• The VAL-ED ratings classified some principals as lower-performing and 
some as higher-performing. In the fall, principals’ overall scores were distributed 
across the four perfonnance levels (8 percent of principals were labeled distinguished, 22 
percent proficient, 43 percent basic, and 27 percent below basic). In the spring 
administration, half the principals received an overall score associated with a 
perfonnance level of proficient or distinguished and half received a score at the basic or 
below basic level. 17 (See exhibit ES.4.) 


17 The increase in average VAL-ED overall scores from the fall to spring is primarily a product of an increase in the 
principal self-ratings. Average ratings of principal leadership based on the three respondent groups were similar in 
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Exhibit ES.4. Distribution of treatment principals across performance levels based on 
VAL-ED overall scores, by assessment window 
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Exhibit Reads: In fall 2012, 8 percent of treatment principals had a VAL-ED overall score at the distinguished performance level, 22 
percent at the proficient level, 43 percent at the basic level, and 27 percent at the below basic level. 

NOTE: Performance level distributions are based on principals’ VAL-ED overall scores at each assessment window. The overall 
score is an average of the scores from the principal’s supervisor, teachers, and the principal’s own self-rated score, with each group 
weighted equally. Sample size = 63 principals for both fall 2012 and spring 2013. Reported percentages may not sum to 100 
percent because of rounding. 

SOURCE: Fall 2012 and Spring 2013 VAL-ED Surveys. 


• VAL-ED ratings provided by principals, supervisors, and teachers in the fall 
were often too different to form a reliable measure, but the spring ratings 
were consistent enough to distinguish between some lower- and higher- 
performing principals. To provide information about a principal’s overall 
effectiveness, the VAL-ED scores should communicate a consistent (i.e., reliable) message 
about the principal’s effectiveness across the three respondent groups (the principal, the 
principal’s supervisor, the principal’s teachers). Based on the literature on 360-degree 
surveys, we would expect correlations between respondent group scores between .25 
and .35. 18 In the fall, however, agreement among the three respondent groups’ overall 
scores was low, with correlations ranging from .06 to .27. In the spring, correlations were 
higher (between .26 and .38), and thus the reports provided a more consistent message 


the fall; however, in the spring, principal self-ratings were higher on average (3.76) than the ratings from their 
supervisors (3.50, p-value of the difference <.05) and teachers (3.57, p-value of the difference < .05). 

18 For the VAL-ED correlations, see Porter et al. (2010). For the literature on 360-degree surveys, see Conway and 
Huffcutt (1997). 
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about a principal’s effectiveness. Viewing and discussing the fall reports may have led 
principals and their supervisors to better align their ratings in the spring. 

Educators’ Performance Evaluation Experiences 

The study’s performance measures were intended to provide educators with perfonnance 
information that was more frequent, systematic, and useful as a guide for professional growth 
than the information that they normally receive. To assess whether this occurred, we compared 
the treatment and control groups’ responses on surveys administered in the spring. Teacher 
surveys were usually completed at the beginning of the last of the four observation windows. 
Principal surveys were completed prior to the spring VAL-ED feedback session, which usually 
occurred at the end of the school year. 

What Were Teachers’ Experiences? 

• Treatment teachers reported receiving more feedback on both their 
classroom practice and their students’ achievement growth than control 
teachers. Treatment teachers reported receiving more feedback sessions with ratings 
and a written narrative than control teachers (3.0 versus 0.7 instances). The average 
treatment teacher also received a larger amount of oral feedback than the average control 
teacher (80 minutes versus 18 minutes). Furthermore, relative to control teachers, 
treatment teachers were more likely to report receiving value-added scores (45 percent 
versus 24 percent) and less likely to report receiving test scores for individual students or 
classroom average scores. 19 

• Among those who reported receiving feedback, treatment teachers 
indicated somewhat more positive perceptions than control teachers about 
the information they received on their classroom practice but not about the 
information on their students’ achievement. Although most teachers in both 
treatment and control groups reported agreeing or strongly agreeing that the feedback on 
their classroom practice provided specific ideas about how to improve, treatment teachers 
were more likely to report so (87 percent versus 79 percent). Almost all teachers 
(approximately 92 percent) in both groups indicated that the feedback on classroom 
practice was a fair assessment of their performance. Control teachers were more likely 
than treatment teachers to report that the student achievement information they received 
was easy to understand (89 percent versus 78 percent). However, less than half of the 
teachers in both groups agreed or strongly agreed that the achievement information was a 
fair assessment of their performance (49 percent for treatment teachers and 43 percent for 
control teachers, not a statistically significant difference) or a fair indicator of teacher 
effectiveness for all teachers (40 percent for treatment teachers versus 29 percent for 
control teachers, a statistically significant difference). 


19 This finding should be interpreted with caution because some teachers may not have had a correct understanding 
of the term “value-added scores.” As a validity check, we compared treatment teachers’ responses with electronic 
records indicating which teachers had accessed their own value-added scores in the online system, and we found that 
34 percent of the treatment teachers who reported receiving value-added scores did not access their student growth 
reports in the online system, and 1 7 percent of treatment teachers who reported not receiving value-added scores 
actually accessed their online student growth reports. 
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What Were Principals’ Experiences? 

• Treatment principals reported receiving more feedback than control 
principals. Treatment principals reported receiving feedback more often than control 
principals (2.0 versus 1.4 instances) and more instances of oral feedback with ratings (1.0 
versus 0.4 instances). The average treatment principal also received more oral feedback 
than the average control principal (60 minutes versus 41 minutes). However, treatment 
principals were no more likely than control principals to report that their supervisors’ 
feedback focused on specific topics related to VAL-ED. 

• Among those who reported receiving feedback, most principals in both 
treatment and control schools had positive perceptions about the feedback 
they received. The majority (more than 70 percent) of the principals in both treatment 
and control schools agreed that the feedback they received was a fair assessment of their 
perfonnance, and approximately two thirds or more of the principals agreed that the 
feedback they received contained specific ideas for improving their performance. Among 
those who received feedback, there was no statistically significant difference between 
treatment and control principals in their perceptions of the feedback. 

Future Report 

This report focuses on findings from the first year of implementation of the study’s three 
perfonnance measures with feedback. Findings about the second year of implementation will be 
presented in the second-year study report. The second-year report also will present findings on 
the impact of the study’s perfonnance measures and feedback on teacher classroom practice, 
principal leadership, and student achievement. 
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